This is an interactive notebook. You can run it locally or use the links below:
Prerequisites
Before you begin, install and import the required libraries, get your W&B API key, and initialize your Weave project.1. Create and iterate on prompts with Weave
Good prompt engineering is critical to guiding the model to properly extract entities. First, you’ll create basic prompt that gives the model the instructions on what to extract from our image data and how to format it. Then, you’ll store the promp in Weave for tracking and iteration.2. Get the dataset
Next, retrieve the dataset of handwritten notes to serve as input for the OCR pipeline. The images in the dataset are alreadybase64
encoded, which means the data can be used by the LLM without any pre-processing.
3. Build the NER pipeline
Next, build the NER pipeline. The pipeline will consist of two functions:- An
encode_image
function that takes a PIL image from the dataset and returns abase64
encoded string representation of the image that can be passed to the VLM - An
extract_named_entities_from_image
function that takes an image and system prompt and returns the extracted entities from that image as described by the system prompt
named_entity_recognation
that:
- Passes the image data to the NER pipeline
- Returns correctly formatted JSON with the results
@weave.op()
decorator decorator to automatically track and trace function execution in the W&B UI.
Every named_entity_recognation
is run, the full trace results are visible in the Weave UI. To view the traces, navigate to the Traces tab of your Weave project.
processing_results.json
. The results are also viewable in the Weave UI.
4. Evaluate the pipeline using Weave
Now that you have created a pipeline to perform NER using a VLM, you can use Weave to systematically evaluate it and find out how well it performs. You can learn more about Evaluations in Weave in Evaluations Overview. A fundamental part of a Weave Evaluation are Scorers. Scorers are used to evaluate AI outputs and return evaluation metrics. They take the AI’s output, analyze it, and return a dictionary of results. Scorers can use your input data as reference if needed and can also output extra information, such as explanations or reasonings from the evaluation. In this section, you will create two Scorers to evaluate the pipeline:- Programatic Scorer
- LLM-as-a-judge Scorer
Programatic scorer
The programmatic scorer,check_for_missing_fields_programatically
, will take the model output (the output of the named_entity_recognition
function), and identify which keys
are missing or empty in the results.
This check is great for identifying samples where the model missed capturing any fields.
LLM-as-a-judge scorer
In the next step of the evaluation, both the image data and the model’s output are provided to ensure the assessment reflects actual NER performance. The image content is explicitly referenced, not just the model output. The Scorer used for this step,check_for_missing_fields_with_llm
, use an LLM to perform scoring (specifically OpenAI’s gpt-4o
). As specified by the contents of the eval_prompt
, check_for_missing_fields_with_llm
outputs a Boolean
value. If all fields match the information in the image and formatting is correct, the Scorer returns true
. If any field is missing, empty, incorrect, or mismatched, the result is false
, and the scorer also returns a message explaining the problem.
5. Run the Evaluation
Finally, define an evaluation call that will automatically loop over thedataset
passed and log the results together in the Weave UI.
The following code kicks off the evaluation and applies the two Scorers to every output from the NER pipeline. Results are visible in the Evals tab in the Weave UI.